Toward a Cross-Linguistic Tagset
نویسنده
چکیده
With the spread of large quantities of corpus data, the need has arisen to develop some standard not only for the format of'interchange of text (an issue which has already been taken up by the Text Encoding Inititiave), but also for any information added in some subsequent stage of (linguistic) enrichment. The research community has much to gain by such standardization since it will enable researchers to e~ectively access and therefore make optimal use of the results of previous work on a corpus. This paper provides some direction of thought as to the development of a standardized tagset. We focus on a minimal tagset, i.e. a tagset conraining information about wordclasses. We investigate what criteria should be met by such a tagset. On the basis of an investigation and comparison of ten different tagse~s that have been used over the years for the (wordclass) tagging of corpora, we arrive at a proposal for a cross-linguistic minima] tagset for Germanic languages I .
منابع مشابه
A Common Parts-of-Speech Tagset Framework for Indian Languages
We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; th...
متن کاملTagset Design and Inflected Languages
An experiment designed to explore the relationship between tagging accuracy and the nature of the tagset is described, using corpora in English, French and Swedish. In particular, the question of internal versus external criteria for tagset design is considered, with the general conclusion that external (linguistic) criteria should be followed. Some problems associated with tagging unknown word...
متن کاملEtiquetario morfosintáctico del SLI para corpus de lengua gallega: aplicación al corpus paralelo TECTRA
In this article we present a complete and normalized morphosyntactic tagset for the annotation of linguistic corpora in Galician. The elaboration of this tagset, designed by the Computational Linguistics Group (SLI) of the University of Vigo, following strictly the EAGLES recommendations (Leech and Wilson, 1996), includes the creation of an intermediate tagset that allows us to establish a corr...
متن کاملA Support Tool for Tagset Mapping
Many different tagsets are used in existing corpora; these tagsets vary according to the objectives of specific projects (which may be as far apart as robust parsing vs. spelling correction). In many situations, however, one would like to have uniform access to the linguistic information encoded in corpus annotations without having to know the classification schemes in detail. This paper descri...
متن کاملPart-of-speech Tagset and Corpus Development for Igbo, an African Language
This project aims to develop linguistic resources to support computational NLP research on the Igbo language. The starting point for this project is the development of a new part-of-speech tagging scheme based on the EAGLES tagset guidelines, adapted to incorporate additional language internal features. The tags are currently being used in a part-of-speech annotation task for the development of...
متن کامل